Skip to content

serve: bound /usage per provider so one slow provider can't stall the response#1748

Merged
steipete merged 4 commits into
steipete:mainfrom
enieuwy:fix/serve-usage-per-provider-timeout
Jun 27, 2026
Merged

serve: bound /usage per provider so one slow provider can't stall the response#1748
steipete merged 4 commits into
steipete:mainfrom
enieuwy:fix/serve-usage-per-provider-timeout

Conversation

@enieuwy

@enieuwy enieuwy commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

codexbar serve's /usage handler (serveUsage) collected providers in a sequential loop with no per-provider timeout. A single slow or hung provider (e.g. a CLI/web fetch that never returns) blocked the entire handler.

The only backstop was the outer request deadline (serveResponseWithDeadline, default 30s), which returns a 504 with an empty body and discards every provider already collected. Worse, the last-known-good merge (CLIServeResponseCache.mergeLastGoodUsageItems) only runs when response.status == .ok, so the 504 path produces zero data — no fresh providers, no cached fallback.

Net effect: one stuck provider makes the whole endpoint return nothing, which pushes shell/Zellij consumers (e.g. showy-quota) onto degraded CLI fallback. Observed live: a wedged serve held port 8080 with /health returning ok while /usage timed out for the full 30s.

Fix

  • Collect providers concurrently, bounding each with BoundedTaskJoin at a budget strictly below the outer request deadline (serveProviderTimeout(requestTimeout:) = requestTimeout * 0.8, or 25s when the deadline is disabled/non-finite).
  • A provider that exceeds its budget contributes its own provider error row instead of blocking the others, so the response still renders every healthy provider with fresh data.
  • Each provider's timeout clock starts when its task is spawned, so a hung provider can't serialize the others' deadlines.
  • Results merge in caller-provided provider order regardless of completion order.
  • The serve usage context's webTimeout is aligned to the per-provider budget (was a fixed 60s that exceeded the 30s request deadline).
  • The outer request deadline is retained as a last-resort safety net.

Last-known-good behavior (scope clarification)

Per-account error rows that carry a cache key are still merged with last-known-good by CLIServeResponseCache exactly as before. A timeout row is account-agnostic (no cacheAccountKey) and is intentionally not reconstructed from last-good — this matches the existing cache rule that "a timeout cannot prove which account is currently active" (staleResponse already refuses usage: keys, and tests assert unkeyed timeout rows are not reconstructed). This PR's win is that a hung provider no longer takes the healthy providers down with it; it does not claim to restore the hung provider's own last value.

Real behavior proof

Built the debug CLI and ran the fixed codexbar serve on a spare port (8092, leaving the app's 8080 untouched). On this machine claude and cursor are genuinely wedged — the same providers that triggered the original incident.

Normal --request-timeout 30 (per-provider budget 24s): /usage returns in 25.3s with the healthy providers intact:

codex        DATA   source=oauth
claude       ERROR  msg="claude usage timed out" kind=provider
cursor       ERROR  msg="cursor usage timed out" kind=provider
antigravity  DATA   source=cli

Pre-fix, the claude/cursor hang would have produced an empty 504 and lost codex + antigravity too. Post-fix they render fresh.

Small --request-timeout 2 (per-provider budget 1.6s): /usage returns in 1.70s — proving the per-provider budget, not the outer deadline, is the cutoff:

codex        DATA   source=oauth
claude       ERROR  msg="claude usage timed out" kind=provider
cursor       ERROR  msg="cursor usage timed out" kind=provider
antigravity  ERROR  msg="antigravity usage timed out" kind=provider

(elapsed measured with curl; provider values/account labels redacted.)

Tests

CLIServeRouterTests:

  • serveProviderTimeout budget mapping, including the strict-below-deadline invariant for sub-second timeouts and the disabled-deadline case.
  • A hung provider (30s simulated) with a 0.1s budget: collection returns in ~0.1s, healthy providers render in caller order, and the hung one yields a single provider error row that is account-agnostic (cacheAccountKey == nil), documenting the deliberate non-reconstruction.

Verified locally: swift build clean (StrictConcurrency), swift test --filter CLIServeRouterTests green, make check (SwiftFormat + SwiftLint) 0 violations.

@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codex review: needs maintainer review before merge. Reviewed June 26, 2026, 10:29 PM ET / 02:29 UTC.

Summary
The PR changes codexbar serve /usage to collect providers concurrently with a finite per-provider timeout, preserve disabled request-timeout semantics, and add focused router tests.

Reproducibility: yes. source-reproducible: current main loops selected /usage providers sequentially and only the outer deadline can stop a hung provider. The PR body also includes live after-fix output from a wedged-provider setup; I did not run live provider probes in this read-only review.

Review metrics: 2 noteworthy metrics.

  • Changed surface: 2 files changed; 178 additions, 9 deletions. The diff is focused on CLI serve behavior and matching router tests.
  • Focused tests: 3 tests added. The new tests cover timeout budgeting, hung-provider behavior, and disabled-deadline semantics.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • [P1] GitHub status showed the macOS test shards still in progress at inspection time, so maintainers should let required checks settle before merge.

Maintainer options:

  1. Decide the mitigation before merge
    Land the bounded per-provider /usage timeout with the latest compatibility fixups once required checks and maintainer review are satisfied.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • No automated repair is needed; this PR is a normal maintainer review and merge candidate after required checks complete.

Security
Cleared: No concrete security or supply-chain concern was found; the diff changes Swift CLI serve logic and focused tests only.

Review details

Best possible solution:

Land the bounded per-provider /usage timeout with the latest compatibility fixups once required checks and maintainer review are satisfied.

Do we have a high-confidence way to reproduce the issue?

Yes, source-reproducible: current main loops selected /usage providers sequentially and only the outer deadline can stop a hung provider. The PR body also includes live after-fix output from a wedged-provider setup; I did not run live provider probes in this read-only review.

Is this the best way to solve the issue?

Yes. Concurrent per-provider collection with BoundedTaskJoin is narrow, preserves --request-timeout 0, and adds focused coverage; I found no existing supported path that already solves this behavior.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against a064edab3e47.

Label changes

Label changes:

  • remove merge-risk: 🚨 compatibility: Current PR review selected no merge-risk labels.

Label justifications:

  • P2: This is a normal-priority CLI server availability fix with limited blast radius and no blocking defect found.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes copied live /usage output from a real wedged-provider setup showing healthy providers returned and slow providers became timeout rows after the fix.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes copied live /usage output from a real wedged-provider setup showing healthy providers returned and slow providers became timeout rows after the fix.
Evidence reviewed

What I checked:

  • AGENTS.md read fully: Repository guidance was read fully; CLI-focused validation guidance, no live Keychain-prompt checks, and Swift concurrency caution were relevant to this review. (AGENTS.md:1, a064edab3e47)
  • Current main still has sequential provider collection: On current main, serveUsage builds one usage context and loops through selection.asList sequentially before returning /usage, so one hung provider can still block the handler until the outer deadline. (Sources/CodexBarCLI/CLIServeCommand.swift:734, a064edab3e47)
  • Documented request-timeout contract: The CLI docs say --request-timeout 0 keeps waiting indefinitely, which the latest PR head now preserves by returning no provider join bound for disabled deadlines. (docs/cli.md:53, a064edab3e47)
  • PR implementation bounds finite provider work: The PR computes serveProviderTimeout, aligns webTimeout, collects providers with serveCollectUsageOutputs, and returns an account-agnostic provider timeout row when a provider exceeds its budget. (Sources/CodexBarCLI/CLIServeCommand.swift:724, 2c3454f1a2bb)
  • Focused regression coverage: The PR adds tests for finite timeout budgeting, oversized timeout clamping, hung-provider collection, timeout row cache-key behavior, and disabled-deadline collection. (Tests/CodexBarTests/CLIServeRouterTests.swift:229, 2c3454f1a2bb)
  • Timeout helper cancels the source task: BoundedTaskJoin resolves .timedOut through cancelSource: true, cancels the source task, and resumes the waiting continuation, matching the PR's intended per-provider cutoff. (Sources/CodexBarCore/BoundedTaskJoin.swift:21, a064edab3e47)

Likely related people:

  • enieuwy: Prior merged history includes serve request deadlines/cache coalescing, serve config reload, and Antigravity serve behavior in the same CLI/router area. (role: feature-history owner; confidence: high; commits: 06b7de126f1a, 894bbcb8c20b, cd201b9e5035; files: Sources/CodexBarCLI/CLIServeCommand.swift, Tests/CodexBarTests/CLIServeRouterTests.swift, docs/cli.md)
  • steipete: Current main blame ties the serve deadline/cache code and BoundedTaskJoin helper to this person, and the latest PR head includes their timeout compatibility fixups. (role: recent area contributor and fixup author; confidence: high; commits: f380287041b8, ac8159f6b10c, a3134da7a536; files: Sources/CodexBarCLI/CLIServeCommand.swift, Sources/CodexBarCore/BoundedTaskJoin.swift, Tests/CodexBarTests/CLIServeRouterTests.swift)
  • ThiagoCAltoe: The original localhost serve command, /usage route surface, router tests, and CLI docs entered through the merged serve-command history. (role: introduced behavior; confidence: medium; commits: 74a019c4aa65; files: Sources/CodexBarCLI/CLIServeCommand.swift, Sources/CodexBarCLI/CLILocalHTTPServer.swift, Tests/CodexBarTests/CLIServeRouterTests.swift)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52c2585b53

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +826 to +829
output.payload.append(Self.makeProviderErrorPayload(
provider: provider,
account: nil,
source: "auto",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve cache keys for provider timeout rows

When an identified provider (for example Codex with visible accounts or any token-account provider) has a previous good /usage response and this new provider-level timeout fires, this synthetic error row is emitted with account: nil and no cacheAccountKey. mergeLastGoodUsageItems only restores cached usage when it can build the (provider, accountID) key from usageCacheKeys, so these timeout rows never match the saved good rows and the endpoint returns the raw timeout error instead of the intended last-known-good fallback.

Useful? React with 👍 / 👎.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. labels Jun 25, 2026
… response

`serveUsage` collected providers in a sequential loop with no per-provider
timeout, so a single slow or hung provider (e.g. a CLI/web fetch that never
returns) blocked the whole `/usage` handler. The only backstop was the outer
request deadline, which returns a 504 with an empty body and discards every
provider already collected — and because that 504 is not `.ok`, the
last-known-good merge (`mergeLastGoodUsageItems`, which requires `.ok`) never
ran. Net effect: one stuck provider made the entire endpoint return nothing,
which pushed shell/Zellij consumers onto degraded CLI fallback.

Collect providers concurrently, bounding each with `BoundedTaskJoin` at a
budget strictly below the outer request deadline (`serveProviderTimeout`).
A provider over budget now contributes a provider error row instead of
blocking the others, so the response stays `.ok`, the cache can restore that
row from last-known-good, and every healthy provider still renders. Each
provider's timeout clock starts when its task is spawned, so a hung provider
cannot serialize the others' deadlines. Results merge in caller-provided
provider order regardless of completion order. The serve usage context's
`webTimeout` is aligned to the per-provider budget (was a fixed 60s that
exceeded the 30s request deadline).

Adds CLIServeRouterTests coverage for the timeout budget and for a hung
provider degrading to an error row without blocking siblings.
@enieuwy enieuwy force-pushed the fix/serve-usage-per-provider-timeout branch from 52c2585 to 94a53fb Compare June 25, 2026 06:32
@enieuwy

enieuwy commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Thanks both — addressed in 94a53fb3.

On the cache-key finding (Codex P2 / ClawSweeper P2): correct, and the PR body over-claimed. I took the "explicitly narrow the behavior and tests" option rather than keying timeout rows. Reconstructing a timeout row from last-good would require re-resolving accounts (including Codex's visibleCodexAccounts()) outside the per-provider budget — reintroducing the blocking call this PR removes — and it contradicts the existing rule that "a timeout cannot prove which account is currently active" (staleResponse already refuses usage: keys, and the current tests assert unkeyed timeout rows are not reconstructed). So a timeout row is intentionally account-agnostic and not restored; per-account error rows that carry a cache key still merge with last-good exactly as before. The code comments, CHANGELOG, and PR body are corrected, and the test now asserts the timeout row has cacheAccountKey == nil / account == nil.

On real behavior proof (P1): added to the PR body. Ran the fixed codexbar serve on a spare port; on this machine claude/cursor are genuinely wedged. With --request-timeout 30 (24s budget) /usage returns in 25.3s with codex + antigravity rendering real data and claude/cursor as timeout error rows — pre-fix that hang produced an empty 504 that lost the healthy providers too. With --request-timeout 2 (1.6s budget) /usage returns in 1.70s, proving the per-provider budget is the cutoff.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 25, 2026
@steipete

Copy link
Copy Markdown
Owner

@clawsweeper re-review

Exact head: 2c3454f1a2bb0143ca0488f919879f3e9d3bc226.

Synced current main and preserved the documented --request-timeout 0 contract: it adds no serve-level provider join bound, while finite deadlines still isolate hung providers, share the outer 24-hour cap, and reject non-finite input. The focused 41-test serve suite, make check, the full 43-shard suite, and local plus whole-branch autoreview pass.

@clawsweeper

clawsweeper Bot commented Jun 27, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@clawsweeper clawsweeper Bot added rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. labels Jun 27, 2026
@steipete steipete merged commit 240f429 into steipete:main Jun 27, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants